This document explores a dataset which includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area during the month of February.
# Import all necessary libraries
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
# Load dataset into a pandas dataframe and read first 5 rows
bike_data = pd.read_csv('201902-fordgobike-tripdata.csv')
bike_data.tail()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 183407 | 480 | 2019-02-01 00:04:49.7240 | 2019-02-01 00:12:50.0340 | 27.0 | Beale St at Harrison St | 37.788059 | -122.391865 | 324.0 | Union Square (Powell St at Post St) | 37.788300 | -122.408531 | 4832 | Subscriber | 1996.0 | Male | No |
| 183408 | 313 | 2019-02-01 00:05:34.7440 | 2019-02-01 00:10:48.5020 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 66.0 | 3rd St at Townsend St | 37.778742 | -122.392741 | 4960 | Subscriber | 1984.0 | Male | No |
| 183409 | 141 | 2019-02-01 00:06:05.5490 | 2019-02-01 00:08:27.2200 | 278.0 | The Alameda at Bush St | 37.331932 | -121.904888 | 277.0 | Morrison Ave at Julian St | 37.333658 | -121.908586 | 3824 | Subscriber | 1990.0 | Male | Yes |
| 183410 | 139 | 2019-02-01 00:05:34.3600 | 2019-02-01 00:07:54.2870 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 216.0 | San Pablo Ave at 27th St | 37.817827 | -122.275698 | 5095 | Subscriber | 1988.0 | Male | No |
| 183411 | 271 | 2019-02-01 00:00:20.6360 | 2019-02-01 00:04:52.0580 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 37.0 | 2nd St at Folsom St | 37.785000 | -122.395936 | 1057 | Subscriber | 1989.0 | Male | No |
# Checking for incorrect datatypes and missing values
bike_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
The dataset contains data of approxiamtely 180,000 rows and 16 columns documenting instances of bikes rented by ford over a period of months. The data contains information in columns encoded in numeric, object and categorical data types.
I am most interested in the various characteristics of people who use bikes from the Ford GoBike System.
I will be making use of the duration of time spent on the bikes in seconds, the longitude and latitude of both the the start and end stations, the user type, member birth year, member gender and bike share for all trips features to conduct my analysis.
Checking for data issues and cleaning accordingly.
# Make a copy of the original dataset
bike_copy = bike_data.copy()
# Checking for missing data
bike_copy.isna().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
# Drop rows that contain missing data
bike_copy.dropna(inplace=True)
# Checking for duplicates in data
bike_copy.duplicated().sum()
0
bike_copy.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.7+ MB
# Checking for the unique values in the gender columns
bike_copy.member_gender.value_counts()
Male 130500 Female 40805 Other 3647 Name: member_gender, dtype: int64
# Checking for the unique values in the user_type column
bike_copy.user_type.value_counts()
Subscriber 158386 Customer 16566 Name: user_type, dtype: int64
# Checking for the unique values in the bike_share_for_all_trip
bike_copy.bike_share_for_all_trip.value_counts()
No 157606 Yes 17346 Name: bike_share_for_all_trip, dtype: int64
# convert cut, color, and clarity into ordered categorical types
ordinal_var_dict = {'member_gender': ['Male', 'Female', 'Other'],
'user_type': ['Subscriber', 'Customer'],
'bike_share_for_all_trip': ['No', 'Yes']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
bike_copy[var] = bike_copy[var].astype(ordered_var)
# Converting start_time and end_time columns to datetime columns
bike_copy['start_time'] = pd.to_datetime(bike_copy['start_time'])
bike_copy['end_time'] = pd.to_datetime(bike_copy['end_time'])
# Changing the data type of the below columns fron float to integers
id_col = ['start_station_id', 'end_station_id', 'bike_id', 'member_birth_year']
bike_copy[id_col] = bike_copy[id_col].astype('int64')
# Calculating ages of bike riders using the members_birth_year column
bike_copy['member_age'] = (datetime.datetime.now().year - bike_copy['member_birth_year']).astype('int64')
bike_copy.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null datetime64[ns] 2 end_time 174952 non-null datetime64[ns] 3 start_station_id 174952 non-null int64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null int64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null category 13 member_birth_year 174952 non-null int64 14 member_gender 174952 non-null category 15 bike_share_for_all_trip 174952 non-null category 16 member_age 174952 non-null int64 dtypes: category(3), datetime64[ns](2), float64(4), int64(6), object(2) memory usage: 20.5+ MB
# Checking the summaryt statistics of the dataset
bike_copy.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | member_age | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 |
| mean | 704.002744 | 139.002126 | 37.771220 | -122.351760 | 136.604486 | 37.771414 | -122.351335 | 4482.587555 | 1984.803135 | 37.196865 |
| std | 1642.204905 | 111.648819 | 0.100391 | 0.117732 | 111.335635 | 0.100295 | 0.117294 | 1659.195937 | 10.118731 | 10.118731 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 | 21.000000 |
| 25% | 323.000000 | 47.000000 | 37.770407 | -122.411901 | 44.000000 | 37.770407 | -122.411647 | 3799.000000 | 1980.000000 | 30.000000 |
| 50% | 510.000000 | 104.000000 | 37.780760 | -122.398279 | 101.000000 | 37.781010 | -122.397437 | 4960.000000 | 1987.000000 | 35.000000 |
| 75% | 789.000000 | 239.000000 | 37.797320 | -122.283093 | 238.000000 | 37.797673 | -122.286533 | 5505.000000 | 1992.000000 | 42.000000 |
| max | 84548.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 | 144.000000 |
Now the data has been cleaned, lets start exploring.
What is the distribution of riders based on gender?
# Using seaborns countplot to see the distribution of bikers based on gender
base_color = sns.color_palette()[0]
sns.countplot(x='member_gender', data=bike_copy, color=base_color)
plt.title('Distribution of Members Based On Gender');
# Using pandas plot function
(bike_copy.member_gender.value_counts(normalize=True)*100).plot(kind='bar')
plt.xticks(rotation=0)
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.title('Percentage of Members Based On Gender');
(bike_copy.member_gender.value_counts(normalize=True)*100)
Male 74.591888 Female 23.323540 Other 2.084572 Name: member_gender, dtype: float64
A large percentage of bikers are male who constitute to about 75% of the member population and females with about 23% and other genders with about 2%.
Which User Type Ride Bikes More?
# Using seaborn countplot function to view the distribution of user types
sns.countplot(x='user_type', data=bike_copy, color=base_color)
plt.title('Distribution of Type Of Bike Users');
Subscribers make use of the bikes than regular customers
How likely are bikers to share a gobike during a trip?
# Using seaborn countplot function to view the distribution of bile sharing
sns.countplot(x='bike_share_for_all_trip', data=bike_copy, color=base_color);
Most bike users prefer Not to share bikes during a trip.
# The duration_sec and member_age have houge outliers that could affect our analysis
# Subsetting for rides below 4000 seconds and member ages below 70
bike_copy = bike_copy[(bike_copy['duration_sec'] <= 4000) & (bike_copy['member_age'] <= 70)].reset_index(drop=True)
bike_copy.describe()
bike_copy.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | member_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974 | Male | Yes | 48 |
| 1 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959 | Male | No | 63 |
| 2 | 1147 | 2019-02-28 23:55:35.104 | 2019-03-01 00:14:42.588 | 300 | Palm St at Willow St | 37.317298 | -121.884995 | 312 | San Jose Diridon Station | 37.329732 | -121.901782 | 3803 | Subscriber | 1983 | Female | No | 39 |
| 3 | 1615 | 2019-02-28 23:41:06.766 | 2019-03-01 00:08:02.756 | 10 | Washington St at Kearny St | 37.795393 | -122.404770 | 127 | Valencia St at 21st St | 37.756708 | -122.421025 | 6329 | Subscriber | 1989 | Male | No | 33 |
| 4 | 1570 | 2019-02-28 23:41:48.790 | 2019-03-01 00:07:59.715 | 10 | Washington St at Kearny St | 37.795393 | -122.404770 | 127 | Valencia St at 21st St | 37.756708 | -122.421025 | 6548 | Subscriber | 1988 | Other | No | 34 |
What is the age range of ford gobikes members?
bin_edges = np.arange(20, bike_copy['member_age'].max()+5, 5)
plt.hist(data=bike_copy, x='member_age', bins=bin_edges)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution Of Ages Of Bike Users');
# Using a boxplot to views the statistical distribution and check for outliers
sns.boxplot(x='member_age', data=bike_copy);
# Looking at distribution of ages between 20 and 70
bin_edges = np.arange(20, bike_copy['member_age'].max()+3, 3)
plt.hist(data=bike_copy, x='member_age', bins=bin_edges)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution Of Ages Of Bike Users');
While we have riders with ages ranging from 20 to bout 80, a large percentage of bikers fall into the age range of 25 - 45.
What period of time do most bikers spend cycling?
# Viewing the distribution of duration in seconds
sns.displot(x='duration_sec', data=bike_copy);
# Using a histogram with a log scale on x axis to visualize
ticks = [3, 10, 30, 100, 300, 1000, 3000]
labels = ['{}'.format(x) for x in ticks]
plt.figure(figsize=[8, 5])
bin_edges = np.arange(0, bike_copy.duration_sec.max()+20, 20)
plt.hist(x='duration_sec', data=bike_copy, bins=bin_edges)
plt.xscale('log')
plt.xlabel('Duration (Seconds)')
plt.ylabel('Frequency')
plt.xticks(ticks, labels)
plt.xlim(30, 5000);
Although most riders prefer to take short trips on the bikes, some bikers spend hours\ on bikes. This will be investigated in futher EDA analysis.
Are some bike stations more frequented than others?
# Adjust bin edges to show true distribution of start station id
bin_edges = np.arange(0, bike_copy.start_station_id.max()+5, 5)
plt.figure(figsize=[8, 5])
plt.hist(x='start_station_id', data=bike_copy, bins=bin_edges)
plt.xlabel('station_id')
plt.ylabel('Frequency')
plt.title('Distribution of start station id');
# Same for end station. Seems like both plots look alike
plt.figure(figsize=[8, 5])
plt.hist(x='end_station_id', data=bike_copy, bins=bin_edges)
plt.xlabel('station_id')
plt.ylabel('Frequency')
plt.title('Distribution of end station id');
The distribution plots of both start and end station ids look similar meaning some stations are more frequented than others.
From the analysis done above, there is a gender imbalance in the data having about 75% male riders.
A large percentage of bike riders are subscribers and most riders prefer not to share rides during a trip
The member age and duration had huge outlier values that could affect our analysis which were eventually dropped during the course of analysis. The histogram plot of the start and end station id were similar in nature which means that some bike stations are more frequented or accessible than others. I also performed a log transformation on the x axis of the histogram plot of duration sec which shows a normal distribution of bike ride durations overall.
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
What is the distribution of the duration of time spent on ford gobikes across the month of february?
# Using pandas .plot metgod plot a timeseries of the start_time vs duration sec
fig, ax = plt.subplots(figsize=(15, 6));
bike_copy.plot(x='start_time', y='duration_sec', ax=ax, legend=False)
plt.xlabel('Start Time')
plt.ylabel('Duration (seconds)')
plt.title('Time Plot Series of Duration Spent On Bikes');
The timeseries plot above shows a periodic distribution with high spikes showing that ta specific time of the days bikers take very long rides on the bikes and also at specific times there are little ore no use of bikes at that periods.
Which categories of customers takes the longest rides.
# Lets plot all three categorical groups together
fig, ax = plt.subplots(nrows = 3, figsize=[8, 12])
sns.boxplot(x='member_gender', y='duration_sec', data=bike_copy, color=base_color, ax=ax[0])
sns.boxplot(x='user_type', y='duration_sec', data=bike_copy, color=base_color, ax=ax[1])
sns.boxplot(x='bike_share_for_all_trip', y='duration_sec', data=bike_copy, color=base_color, ax=ax[2]);
From the plot above it shows that on an average females and other genders take slighly longer rides than females although percentage of males who ride bikes are higher than others. Same with subscribers and customers, whereby customers have an average duration on bikes than subscribers who constititute more in the data. This could be due to the imbalance of the class/categorical data.
bike_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 172728 entries, 0 to 172727 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 172728 non-null int64 1 start_time 172728 non-null datetime64[ns] 2 end_time 172728 non-null datetime64[ns] 3 start_station_id 172728 non-null int64 4 start_station_name 172728 non-null object 5 start_station_latitude 172728 non-null float64 6 start_station_longitude 172728 non-null float64 7 end_station_id 172728 non-null int64 8 end_station_name 172728 non-null object 9 end_station_latitude 172728 non-null float64 10 end_station_longitude 172728 non-null float64 11 bike_id 172728 non-null int64 12 user_type 172728 non-null category 13 member_birth_year 172728 non-null int64 14 member_gender 172728 non-null category 15 bike_share_for_all_trip 172728 non-null category 16 member_age 172728 non-null int64 dtypes: category(3), datetime64[ns](2), float64(4), int64(6), object(2) memory usage: 18.9+ MB
Lets look at a PairGrid and heatmap of numerical values in the data set to look for correlation between variables!
# Listing out all numerical values
numerical_col = ['duration_sec', 'start_station_id', 'end_station_id', 'bike_id', 'member_birth_year', 'member_age']
correlation = bike_copy[numerical_col].corr()
sns.heatmap(correlation, cmap = 'vlag_r');
# plot matrix: sample 1000 entries
samples = np.random.choice(bike_copy.shape[0], 1000, replace = False)
bike_sample = bike_copy.loc[samples,:]
g = sns.PairGrid(data = bike_sample, vars = ['duration_sec', 'start_station_id', 'end_station_id', 'bike_id', 'member_birth_year', 'member_age'])
g = g.map_diag(plt.hist, bins = 20)
g.map_offdiag(plt.scatter, alpha = 0.1);
Seems that there is little or no correlation between most of the columns with each other apart from member birth year and member age which makes a lot of sense. There is also correlation between start station id and end station id
From the boxplot plot above it shows that on an average females and other genders take slighly longer rides than females although percentage of males who ride bikes are higher than others. Same with subscribers and customers, whereby customers have an average duration on bikes than subscribers who constititute more in the population. This could be due to the imbalance of the class/categorical data.
The timeplot plot series plotted above, there is periodic distribution of of duration at time periods showing that at specific time of the days bikers take very long rides on the bikes and also at specific times there are little ore no use of bikes at that periods.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
What is the average time spent on bikes by each gender type and does it vary based on whether they are subscribers or customers?
# creating a function to plot multivariate plots
def multiplot(data, x, hue):
"""Plots a multivariate plot based on the duration of time"""
sns.barplot(data=data, x=x, y='duration_sec', hue=hue)
plt.xlabel('{}'.format(x))
plt.ylabel('Duration (sec)');
multiplot(bike_copy, 'member_gender', 'user_type')
Based on the user type, subscribers spend lesser time on bikes than customers. For subscribers, other gender spend the most time on bikes followed by females and then males. For customers, females spend the most time on bikes.
What is the average time spent on bikes by each user type and does it vary based on whether they share bikes for all trips?
# Using the multi[plot function created earlier
multiplot(bike_copy, 'user_type', 'bike_share_for_all_trip')
Interstingly, no customers shared their bikes while on a trip. This may be due to the fact that maybe the bike sharing feature not being available to non-subscribers but since may be just a speculation as we dont have any information on this.
Lets make use of our start station latitude, longitudes and ids, end station latitude, longitudes and ids to create a scatter map box that will show is the various areas that bikers start and end their rides.
# using plotly library to create a mapbox
fig = px.scatter_mapbox(
bike_copy, # Our DataFrame
lat='start_station_latitude',
lon='start_station_longitude',
width=800, # Width of map
height=600, # Height of map
color='start_station_id',
hover_data=["start_station_id"], # Display duration(sec) when hovering mouse
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()